3.1. The UC Irvine Machine Learning Repository1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
kable(summary(Glass[,1:5]))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| RI | Na | Mg | Al | Si | |
|---|---|---|---|---|---|
| Min. :1.511 | Min. :10.73 | Min. :0.000 | Min. :0.290 | Min. :69.81 | |
| 1st Qu.:1.517 | 1st Qu.:12.91 | 1st Qu.:2.115 | 1st Qu.:1.190 | 1st Qu.:72.28 | |
| Median :1.518 | Median :13.30 | Median :3.480 | Median :1.360 | Median :72.79 | |
| Mean :1.518 | Mean :13.41 | Mean :2.685 | Mean :1.445 | Mean :72.65 | |
| 3rd Qu.:1.519 | 3rd Qu.:13.82 | 3rd Qu.:3.600 | 3rd Qu.:1.630 | 3rd Qu.:73.09 | |
| Max. :1.534 | Max. :17.38 | Max. :4.490 | Max. :3.500 | Max. :75.41 |
kable(summary(Glass[,6:10]))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|
| Min. :0.0000 | Min. : 5.430 | Min. :0.000 | Min. :0.00000 | 1:70 | |
| 1st Qu.:0.1225 | 1st Qu.: 8.240 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 2:76 | |
| Median :0.5550 | Median : 8.600 | Median :0.000 | Median :0.00000 | 3:17 | |
| Mean :0.4971 | Mean : 8.957 | Mean :0.175 | Mean :0.05701 | 5:13 | |
| 3rd Qu.:0.6100 | 3rd Qu.: 9.172 | 3rd Qu.:0.000 | 3rd Qu.:0.10000 | 6: 9 | |
| Max. :6.2100 | Max. :16.190 | Max. :3.150 | Max. :0.51000 | 7:29 |
There 9 numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe) and one categorical target with 6 levels (Type). The data set is highly skewed toward categories 1 and 2 with 146 out of 214 observations falling into those two categories alone. Ba, and Fe have their first quartile value at 0 indicating that they have a lot of zeros in their distributions.
Glass[,1:9] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")Histograms of the predictors show that all of them except maybe one, Si are moderately to highly skewed. The histogram below has been colored by the target categories and shows that the distributions for different targets values are slightly different (especially for Al and Na) indicating that those variables may be good predictors.
Glass %>%
gather(-Type, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Type)) +
geom_histogram(bins=10, alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")Those differences in the distributions of the predictors when separated by the target variable are even more evident in the bar plots below. All of the predictors show a lot of variation in their distributions for each target value. We can also see a lot of outliers which are indicated by the red dots to the top and bottom of the plot ‘whiskers’.
Glass %>%
gather(-Type,key = "var", value = "val") %>%
ggplot(aes(x=factor(Type), y=val)) +
geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
size=2, show.legend=TRUE) +
stat_summary(aes(colour="median"), fun.y=median, geom="point",
size=2, show.legend=TRUE) +
facet_wrap(~ var, scales = "free", ncol=3) +
labs(colour="Statistics", x="", y="") +
scale_colour_manual(values=c("#9900FF", "#3300FF")) +
theme(panel.background=element_blank())Only RI and Ca seem to be very highly correlated with each other with a correlation of about 0.81.
Glass %>%
ggscatmat(color="Type", alpha=0.6) +
scale_color_manual(values=c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
theme(panel.background=element_blank(), legend.position="top",
axis.text.x = element_text(angle=-40, vjust=1, hjust=0))## [1] 7
The findCorrelation function suggests removing column 7 which corresponds to Ca.
There are no missing values in the dataset.
missing <- data.frame(t(apply(is.na(Glass), 2, sum)))
# kable(missing, col.names = "NA's")
row.names(missing) <- c("NA's")
kable(missing) %>%
kable_styling(bootstrap_options = c("condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| NA’s | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
As noted above most if not all of the predictors have both outliers and skewed distributions as well as a large percentage of zeroes in a few predictors.
skewness_statistic <- apply(Glass[,1:9], 2, skewness)
ratio_max_to_min <- apply(Glass[,1:9], 2, max)/apply(Glass[,1:9], 2, min)
kable(data.frame(skewness_statistic, ratio_max_to_min)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| skewness_statistic | ratio_max_to_min | |
|---|---|---|
| RI | 1.6027151 | 1.015075 |
| Na | 0.4478343 | 1.619758 |
| Mg | -1.1364523 | Inf |
| Al | 0.8946104 | 12.068965 |
| Si | -0.7202392 | 1.080218 |
| K | 6.4600889 | Inf |
| Ca | 2.0184463 | 2.981584 |
| Ba | 3.3686800 | Inf |
| Fe | 1.7298107 | Inf |
Because there are so many zero values in the data it’s hard to estimate the severity of the skew using the max to min ratio since division by zero is undefined. But we clearly have some large skewness values and can clearly see the skew in the histograms.
The skewed distributions might be improved by log, square root, or inverse transformations or a Box-Cox transformation.
Unfortunately because of the zero values Mg, K, Ba, and Fe cannot be transformed using Box-Cox. I’m still not sure how to handle that although I found this post by Rob Hyndman on the subject… Transforming data with zeros.
# https://stackoverflow.com/questions/46485024/how-to-use-boxcoxtrans-function-in-r
GlassBC <- apply(Glass[,1:9], 2, BoxCoxTrans)
trans_Glass <- purrr::map2(GlassBC, Glass[,1:9], function(x, y) predict(x, y))
trans_Glass_df = as.data.frame(do.call(cbind, trans_Glass))
kable(head(trans_Glass_df)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|
| 0.2838746 | 2.613007 | 4.49 | 0.0976177 | 2575.684 | 0.06 | 0.8254539 | 0 | 0.00 |
| 0.2829051 | 2.631169 | 3.60 | 0.3323808 | 2644.326 | 0.48 | 0.8145827 | 0 | 0.00 |
| 0.2824954 | 2.604909 | 3.55 | 0.4819347 | 2663.270 | 0.39 | 0.8139144 | 0 | 0.00 |
| 0.2829194 | 2.580974 | 3.69 | 0.2715633 | 2635.606 | 0.57 | 0.8195032 | 0 | 0.00 |
| 0.2828507 | 2.585506 | 3.62 | 0.2271057 | 2669.843 | 0.55 | 0.8176698 | 0 | 0.00 |
| 0.2824323 | 2.548664 | 3.61 | 0.5455844 | 2661.810 | 0.64 | 0.8176698 | 0 | 0.26 |
trans_Glass_df %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")I’m not seeing a lot of improvement in the histograms after transformation with Box-Cox so let’s try something else… Another post suggests three ways of handling the zeros:
- Add a constant value © to each value of variable then take a log transformation
- Impute zero value with mean
- Take square root instead of log for transformation2
Let’s try the square root transformation first.
# Square Root transformation
Glass_SqRt <- data.frame(apply(Glass[,1:9], 2, sqrt))
Glass_SqRt %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")The square root transformation seems to have done a little bit better at taming our distributions, however they are still highly skewed in some cases.
Let’s try one more of the suggestions by using \(log(x+1)\) to transform our data.
# log(x+1) transformation
Glass_log <- data.frame(apply(Glass[,1:9]+1, 2, log))
Glass_log %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")Maybe a little better, but still not great!
kable(nearZeroVar(Glass, names = TRUE, saveMetrics=TRUE)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| RI | 1.000000 | 83.177570 | FALSE | FALSE |
| Na | 1.000000 | 66.355140 | FALSE | FALSE |
| Mg | 5.250000 | 43.925234 | FALSE | FALSE |
| Al | 1.333333 | 55.140187 | FALSE | FALSE |
| Si | 1.000000 | 62.149533 | FALSE | FALSE |
| K | 2.500000 | 30.373832 | FALSE | FALSE |
| Ca | 1.000000 | 66.822430 | FALSE | FALSE |
| Ba | 88.000000 | 15.887851 | FALSE | FALSE |
| Fe | 20.571429 | 14.953271 | FALSE | FALSE |
| Type | 1.085714 | 2.803738 | FALSE | FALSE |
The caret function nearZeroVar does not indicate that the variabes with high frequenciees of zeroes should be removed so another solution might be to try the alternative Box-Cox transformation that Hyndman suggested in his blog post.3
Alt_BoxCox_Glass <- log1p(Glass[,1:9])
kable(head(Alt_BoxCox_Glass)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|
| 0.9246596 | 2.683758 | 1.702928 | 0.7419373 | 4.287441 | 0.0582689 | 2.277267 | 0 | 0.0000000 |
| 0.9233100 | 2.700690 | 1.526056 | 0.8586616 | 4.300410 | 0.3920421 | 2.178155 | 0 | 0.0000000 |
| 0.9227419 | 2.676216 | 1.515127 | 0.9321641 | 4.303930 | 0.3293037 | 2.172476 | 0 | 0.0000000 |
| 0.9233299 | 2.653946 | 1.545433 | 0.8285518 | 4.298781 | 0.4510756 | 2.221375 | 0 | 0.0000000 |
| 0.9232346 | 2.658159 | 1.530395 | 0.8064759 | 4.305146 | 0.4382549 | 2.204972 | 0 | 0.0000000 |
| 0.9226544 | 2.623944 | 1.528228 | 0.9631743 | 4.303660 | 0.4946962 | 2.204972 | 0 | 0.2311117 |
Alt_BoxCox_Glass %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")I think that in this case it mght be necessary to know if these are real zeroes, or ‘censored’ data where the zeroes may actually be trace amounts that are below the detection limit. In this case Kuhn and Johnson state:
when a sample has a value below the limit of detection, the actual limit can be used in place of the real value. For this situation, it is also common to use a random number between zero and the limit of detection.4
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
# library(mlbench)
data(Soybean)
## See ?Soybean for details
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
[,1] Class the 19 classes
[,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3] plant.stand normal(0),lt-normal(1).
[,4] precip lt-norm(0),norm(1),gt-norm(2).
[,5] temp lt-norm(0),norm(1),gt-norm(2).
[,6] hail yes(0),no(1).
[,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9] sever minor(0),pot-severe(1),severe(2).
[,10] seed.tmt none(0),fungicide(1),other(2).
[,11] germ 90-100%(0),80-89%(1),lt-80%(2).
[,12] plant.growth norm(0),abnorm(1).
[,13] leaves norm(0),abnorm(1).
[,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2).
[,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2).
[,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2).
[,17] leaf.shread absent(0),present(1).
[,18] leaf.malf absent(0),present(1).
[,19] leaf.mild absent(0),upper-surf(1),lower-surf(2).
[,20] stem norm(0),abnorm(1).
[,21] lodging yes(0),no(1).
[,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24] fruiting.bodies absent(0),present(1).
[,25] ext.decay absent(0),firm-and-dry(1),watery(2).
[,26] mycelium absent(0),present(1).
[,27] int.discolor none(0),brown(1),black(2).
[,28] sclerotia absent(0),present(1).
[,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3).
[,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31] seed norm(0),abnorm(1).
[,32] mold.growth absent(0),present(1).
[,33] seed.discolor absent(0),present(1).
[,34] seed.size norm(0),lt-norm(1).
[,35] shriveling absent(0),present(1).
[,36] roots norm(0),rotted(1),galls-cysts(2).
# reorder columns by number of levels in factors
Soybeans <- Soybean[, c(1,2,7,8,22,23,29,30,4,5,9:11,14:16,19,25,27,36,
3,6,12,13,17,18,20,21,24,26,28,31:35)]
kable(summary(data.frame(Soybeans[,1:2]))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,3:8])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,9:14])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,15:20])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,21:28])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,29:36])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))par(mar=c(1,1,1,1))
Soybean %>%
gather(-Class, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Class)) +
geom_bar(alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#f1b6da','#c51b7d','#b2df8a','#33a02c','#fb9a99',
'#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a',
'#ffeda0','#dfc27d','#80cdc1','#01665e','#a6cee3',
'#1f78b4','#bf812d','#8c510a','#bababa')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")Methods to deal with zero values while performing log transformation of variable↩
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.↩